November 2025

Introduction

The Challenge

  • Generative AI tools in low- and middle-income countries are multiplying

    • AI-powered math tutors for children
    • Digital advisory tools for farmers
    • Health assistants
  • Problem: While some studies show effectiveness (e.g., Henkel et al., 2024), others show AI applications exhibit unexpected and unwanted behavior that can be harmful to users (e.g., Bastani et al., 2024)

  • Gap: While there is broad consensus on the importance of evaluating GenAI in the development sector, there has been little agreement on what this actually means

  • Consequence: In the absence of clear standards, organizations have adopted very different evaluation approaches

The Problem: Fragmented Approaches

  • Tech-focused organizations: Emphasize model/product performance, neglect development outcomes

  • Development-focused organizations: Default to RCTs, ignore model and product evaluations

  • Funders: Lack clarity on what evaluations to expect or what “right-sized” evaluation entails

  • Reality: All methods are complementary and should be used together at different stages

The Framework

The Solution: Four-Level Framework

This playbook organizes evaluation around four levels:

  1. Level 1 – Model evaluation: Does the AI model produce the desired responses?

  2. Level 2 – Product evaluation: Does the product facilitate meaningful interactions?

  3. Level 3 – User evaluation: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?

  4. Level 4 – Impact evaluation: Does access to the product improve development outcomes?

Continuous Evaluation: The Central Element

  • Unlike earlier rule-based digital tools, GenAI’s unique sensitivity to the underlying model, architecture, data, and prompts demands new evaluation methods

  • The underlying components can evolve far faster than in earlier digital technologies - with new AI models and technologies being released weekly

  • Developers must ensure their applications perform as intended over time, even as updates are released

  • Continuous evaluation thus becomes essential, enabling developers to:

    • Iterate quickly
    • Maintain expected behavior
    • Steadily improve performance and impact
  • This focus on continuous evaluation, while commonplace in software companies, might be less familiar in the development sector where programs are often judged by one-off experiments

The Development Sector Challenge

  • Development sector: Programs often judged by one-off experiments (evaluation as finish line)

  • Our approach: Rapid, ongoing cycles where deployment, adaptation, evaluation, and improvement happen in tandem

  • This is commonplace in software companies, but less familiar in development sector

Framework Origins

Natural Progression of Evaluation Levels

  1. Level 1: Build AI model pipeline, benchmark performance
  2. Level 2: Test product for usability and user engagement
  3. Level 3: Understand effects on thoughts, feelings, knowledge, behaviors
  4. Level 4: Assess long-term improvements in development outcomes

Tech sector: Typically stops at Levels 1-2 (engagement predicts success)

Development sector: Higher bar - does it improve lives in meaningful, measurable, cost-effective ways?

Who Should Use This Guide?

  • AI Engineers → Model behavior (Level 1)
  • Product Managers & Data Scientists → Product analytics (Level 2)
  • Psychologists & Behavioral Researchers → User thoughts, feelings, behaviors (Level 3)
  • Impact Evaluators → Social impact (Level 4)

Key: All actors must see beyond their slice of the evaluation process

Building Blocks for GenAI Evaluation

Four concrete, actionable steps that move evaluation from theory into practice:

  1. Construct a clear user funnel across Levels 1-4
  2. Build and track metrics through robust ETL pipelines
  3. Diagnose weak links through targeted hypotheses
  4. Run experiments with rigor and speed

Building Block 1: User Funnel

A user funnel is a structured way to map how individuals move through your product or program, from first exposure to long-term life impact. A comprehensive funnel creates a shared framework for tracking a user’s experience through a journey.

To build a robust funnel, teams should begin by defining the final development outcome they’re targeting (Level 4) - for instance, improved learning outcomes, better health, or increased crop yields. From there, work backward to break down the journey into specific user stages.

User Funnel: The 6 Stages

Stage Description Evaluation Level
1. Recruitment Beneficiary identified and enters program Level 2
2. Onboarding User introduced to AI product and completes setup Level 2
3. Engagement User begins actively interacting with AI product Level 2
4. Retention User continues engaging over time (not dropping off) Level 2
5. Proximal Outcome Near-term cognitive or behavioral change Level 3
6. Development Outcome Long-term desired result achieved Level 4

User Funnel: What to Define for Each Stage

For each stage, teams should clearly define:

Element Description
What program does Actions to bring users into that stage
What user must do User actions that count as entering the stage
Metric Measurement that confirms entry (e.g., login rate, session length)
Target values Target metric values and transition rates between stages
Costs Costs associated with moving a user through the stage
DRIs Directly Responsible Individuals for performance and metrics

Building Block 2: ETL Pipelines

A well-designed evaluation framework is only as good as the data infrastructure that supports it. At the heart of that infrastructure is a robust ETL pipeline - a system that extracts, transforms, and loads data to power consistent, reliable measurement of program indicators.

  • Extract: Collect data from various sources - chat logs, product telemetry, survey tools, third-party APIs, or spreadsheets

  • Transform: Clean, standardize, and reshape the raw data into a usable format. This could involve timestamp alignment, anonymization, session stitching, or deriving new metrics like time-on-task indicators

  • Load: Store the transformed data in a centralized system (like a data warehouse or analytics dashboard) where teams can access it for analysis, visualization, or modeling

  • Why critical: AI products, especially those using generative models, produce high volumes of complex, often unstructured data: prompts, outputs, clicks, feedback, engagement patterns, and more. Without a clear ETL pipeline, turning raw data into actionable metrics at scale becomes unreliable and slow.

  • Example: A product designed to support adolescent mental health might collect model-level outputs (Level 1), engagement logs (Level 2), behavioral indicators (Level 3), and outcome data (Level 4) - all requiring integration through a robust pipeline.

Building Block 3: Targeted Hypotheses

Once a user funnel is in place and metrics are flowing through a robust ETL pipeline, the next challenge is understanding why certain funnel metrics are underperforming.

Process:

  1. Identify drop-offs: Start by identifying major user drop-offs along the funnel

  2. Develop hypotheses: Pose specific, testable questions: Why are users stalling? What mechanism explains this?

  3. Surface competing hypotheses: For example, if engagement dips after onboarding: Is value proposition unclear? Are users overwhelmed? Do they mistrust the AI?

  4. Test hypotheses: Each hypothesis becomes a lens for focused measurement or experiments

Goal: Make evaluation generative - helps teams ask better questions, faster. This approach sits at the intersection of product management, UX research, and behavioral science.

Building Block 4: Experiments

Test hypotheses through experimentation:

  • A/B tests: For lightweight product tweaks (prompts, messages)
  • Hold-out tests: For more complex behavioral shifts
  • Pragmatic RCTs: For policy-relevant questions
  • Full RCTs: When causal question justifies cost

Key: Match experimentation to product maturity, hypothesis scale, and decision stakes. Tools like Evidential help teams automate randomization and track results.

Building Cross-Functional Teams

Evaluation is a team sport - no single role covers all four levels.

Level Lead Roles Support Roles
Level 1 AI Engineers, ML Researchers Domain Experts, Product Owners
Level 2 Product Managers Data Scientists, Data Engineers
Level 3 Psychologists, UX Researchers Data Scientists
Level 4 Policy Researchers, Economists AI Engineers

Best Practices for Cross-Functional Teams

  1. Look Beyond Your Slices: Understand how your work affects other levels
  2. Pair Engineers with Domain Experts Early: Involve domain experts at Level 1 from start
  3. Identify Cross-Functional Lead: Product Managers connect dots across roles
  4. Use Shared Evaluation Language: Common vocabulary using levels
  5. Use Tools that Support Collaboration: Evaluation platforms, dashboards, experimentation tools

Level 1: Model Evaluation

Level 1: Model Evaluation

Question: Does the AI model produce the desired responses?

Why important: AI models, especially large language models (LLMs) and related foundational models, do not “understand” content in the way humans do. Instead, they generate outputs by predicting the next word in a sequence based on statistical patterns in their training data. Because of this, models can hallucinate or appear fluent and convincing while still being inaccurate, biased, irrelevant, or even harmful.

This makes structured model evaluation essential. We need to systematically and rigorously assess whether an AI system consistently meets conditions such as usefulness, accuracy, appropriateness, and safety across diverse tasks and user contexts. This is especially critical when AI tools are deployed in sensitive domains like education, health, or agriculture, where misinformation or misalignment can cause real harm.

Beyond ensuring safety, developers must also evaluate that their AI systems exhibit desirable behaviors and characteristics proven to have a real-world impact. For instance, an AI tutor should follow pedagogical best practices - such as withholding answers to encourage self-directed learning and accurately gauging a student’s level to tailor instruction.

What is Being Assessed?

Most Generative AI applications are built on foundational models like those from OpenAI (GPT), Anthropic (Claude), Google (Gemini), or Meta (Llama). However, your application is a full system, not just the foundational model. It includes many other components that can be grouped into three buckets:

  • Pre-processing: Before handing off the input from the user to the LLM, you may wish to transform it into a format suitable for the LLM. Examples include: sanitizing or filtering language; converting speech to text; paraphrasing the user’s request; translation from a low-resource language to a high-resource one.

  • LLM context preparation: An LLM takes three things as input: the “prompt” or system instructions, the user’s input after being pre-processed, and a “context” which can include past conversation history, relevant content retrieved from your knowledge base, or even tools available for the LLM to call.

  • Post-processing: Before returning the output to the user, you may also wish to transform it into the correct format and check the output using safety or quality guardrails. Examples include: hallucination checker, converting text to speech, translation to the user’s preferred language.

Example: An AI agronomist in Senegal, answering questions from farmers in Pulaar, might: (a) check input for malicious intent, (b) translate from Pulaar to English, (c) retrieve relevant content from database, (d) retrieve information about the farmer, (e) generate an answer, (f) check that the answer is grounded, (g) translate back to Pulaar. Model evaluations cover this entire pipeline.

Who is Most Involved?

Role Responsibility
AI Engineers, ML Researchers Execute - Lead model evaluation process
Domain Experts, Product Owners Support - Define evaluation rubrics

How to Evaluate: 6 Steps

  1. Decide on rubric: What characteristics should the solution embody?
  2. Decide on metrics: How to measure performance against rubric
  3. Develop golden dataset: Representative user interactions
  4. Run automated evaluations: Integrate into CI/CD pipeline
  5. Measure scores and refine: Use scores to improve solution
  6. Red-teaming: Actively try to break/pressure test before release

Step 1: Decide on Rubric

Responsible: Product Owners and Domain Experts (with Engineering support)

Question: “What characteristics should our AI solution embody?” These are qualitative goals (e.g., “Trustworthy”, “On-Brand”, “Concise”). Most of the rubric will be determined by your use case, context, and impact goals. This step requires reflection and discussion with stakeholders - it is critical and guides the rest of your evaluation steps.

Step 1: Rubric Examples

Organization Product Rubric Items
Jacaranda Health PROMPTS: Maternal health SMS service (Swahili/English) Medical accuracy, personability, simplicity (Stanford Center for Digital Health, 2025)
Digital Green Farmer.Chat: Agricultural advice platform (40+ crops, 4 countries) Faithfulness, relevance, accessibility (Singh et al., 2024)

Recommendation: Restrict to ~5 items. Longer lists = more expensive and difficult. Tradeoffs exist (e.g., concise vs. complete, friendly vs. direct).

Step 2: Decide on Metrics

Responsible: Engineering (with Product Owner validation)

Engineering translates qualitative rubric into quantitative metrics (e.g., “Trustworthy” → “Factual Consistency Score”). Product Owner validates that technical metrics are acceptable proxies for business goals.

Terminology: Rubric item → Metric → Scorer → Score
Example: “helpful” → “answer relevance” → “LLM-as-judge” → “4 out of 5”

Step 2: Scorer Categories

Category Examples Speed Accuracy Cost Best For
Statistical BLEU, ROUGE, METEOR, WER +++++ ++ + Specific tasks
Model-based AlignScore/LIM-RA, BLEURT, BARTScore, COMET +++ +++ +++ Domain-specific tasks
LLM-as-judge G-Eval, RARR ++ +++++ +++++ Flexible evaluation
Human evaluation Human evaluation + +++++ ++++++ Calibration & QA

Ideal: Combination of methods. Human evaluation’s primary role: create “answer key” to calibrate automated scorers and final QA. Note: Human evaluation has its own biases.

Step 3: Develop Golden Dataset

Responsible: Product Owner (with Domain Experts and Engineering support)

Product Owner ensures quality, scope, and representativeness. Domain Experts author ideal answers. Engineering provides technical support.

Step 3: Develop Golden Dataset

Source When to Use Notes
Past transaction data Adding AI to existing application Extract question-answer pairs from human-answered queries
Human-annotated data Building new AI offering Generate questions + expert answers. Warning: Don’t use LLM to generate answers for experts to verify - correcting is harder than creating
Customize public datasets High-quality public dataset exists Subset and augment to match your context

Step 3: Develop Golden Dataset

A good dataset should cover:

  • Types of questions: Not just what, but also how (tone, language, code-switching, informality, spelling errors, verbosity)
  • Out-of-context requests: Questions your application doesn’t support
  • Adversarial requests: Abusive input, prompt injection, jailbreaking, data & privacy attacks

Step 4: Running Automated Evals

Responsible: Engineering (with Product Owner support)

Automate evaluations and integrate into CI/CD pipeline. Product Owner monitors performance trends over time.

Step 4: Running Automated Evals

Eval Type Examples Frequency When
Low-Cost Statistical scorers (ROUGE), model-based Every commit Fast feedback, limited scope
High-Cost LLM-as-judge scorers Nightly/weekly/before release Comprehensive but expensive

Tracking: Use observability tools (Logfire, Helicone, Langfuse) to plot metric scores over time. Dashboard helps track progress against rubric goals.

Step 5: Measure Scores and Refine

Responsible: Engineering (with Product Owner support)

Initial scores will reveal areas for improvement. Results are a diagnostic tool, not a final grade. Engineering analyzes results and diagnoses root causes. Product Owner prioritizes refinement work.

Step 5: Measure Scores and Refine

Step Action Purpose
1. Isolate Problem Identify which component is failing Modern AI has many components (retrieval, prompting, model params)
2. Use Traces Inspect inputs/outputs of each component Pinpoint root cause (e.g., ineffective retrieval, poor prompt)
3. Unit Tests Implement component-level tests Validate specific logic, catch regressions early

Goal: Turn evaluation into a process that helps teams ask better questions and improve iteratively.

Step 6: Red-Teaming

Responsible: Product Owner (with Engineering, QA, or Security/Ethics Team support)

Product Owner ensures red-teaming is conducted and prioritizes remediation. Technical teams execute adversarial testing.

What is red-teaming? Structured adversarial testing to proactively discover vulnerabilities, biases, and failure modes before users do. Think like a malicious actor, confused user, or edge-case generator.

Step 6: When is Red-Teaming Critical?

Scenario Why Critical
Agentic/Flexible Solutions More pathways for failure (web browsing, code execution, multi-step decisions)
Long Conversation Histories Cumulative errors - small issue in turn 1 amplified by turn 10
High-Risk Domains Maternal health, medical advice, financial planning - severe impact of failure
Population-Scale Unknown interaction patterns; improbable behaviors will occur at scale

Step 6: Red-Teaming Process

Level 2: Product Evaluation

Level 2: Product Evaluation

Question: Does the product facilitate meaningful interactions?

Why important: Beyond evaluating how the AI model performs against key metrics, organizations need to assess how well the product engages real users and whether it solves a meaningful problem for the user. It is unlikely that a product will shift development outcomes if it fails to engage its users. Like model evaluation, this type of evaluation is a continuous and iterative process, rather than one-off.

Technology companies frequently evaluate and improve products by collecting user interaction metrics and then running rapid cycles of digital experiments. For example, they may track a user’s journey on a website, automatically collecting records like which products users click on and whether they return to the site. Then, they can compare how different web or app experiences affect browsing time or user satisfaction.

Unique advantages of digital: This rapid, iterative process is enabled by two factors unique to digital interventions: (1) iterations of the product can be precisely and efficiently deployed to different users, and (2) on-platform engagement outcomes are costless to collect and transform into meaningful engagement metrics.

Who is Most Involved?

Role Responsibility
Product Managers Execute - Directly responsible for product metrics
Data Scientists Support - Apply evaluation methods
Engineers Support - Build and roll out features

Evaluation Methods

Measurement Tools: Categories

Category Metric Type Examples
Retention User-Level Retention DAU/MAU, session count
Engagement Action-Based Response rate, clicks, rewrites
Engagement Interaction Duration Session length, conversational turns
Engagement Feature Uptake Click-through to links, feature use
Non-Engagement Quality Scores Toxicity score, informativeness
Non-Engagement Item-Level Surveys “Helpful” ratings, “want more” votes
Non-Engagement User-Level Surveys Overall satisfaction, usability
Non-Engagement User Control Topic subscriptions, filtering

Level 3: User Evaluation

Level 3: User Evaluation

Question: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?

Why important: Once product functions correctly (Level 1) and engages users (Level 2), ask:

Is it changing how users think, feel, or act?

User psychological and behavioral changes are early indicators of long-term development goals. These evaluations are faster and cheaper than full impact evaluations, allowing rapid iteration.

Level 3: Focus Areas

Area Question Example Constructs
Cognitive Are users gaining new knowledge or correcting misconceptions? Do they demonstrate improved skills or decision-making ability as a result of using the AI? Users’ comprehension, reflection, reasoning, and perceived clarity or understanding during interaction
Affective How does the product make users feel? Do users report feeling supported, motivated, and capable after interactions, or are there indications of frustration, confusion, or emotional distress? Mood, sense of belonging, perceived empathy, trust, or comfort interacting with AI
Behavioral Are users taking small but meaningful actions (e.g., asking more questions, trying out recommended behaviors) that would predict their longer-term development? Users’ acquisition, recall, and application of factual or procedural information, and observable behaviors (e.g., asking more questions, trying out recommended behaviors) that are proxies for longer-term development outcomes

Who is Most Involved?

Role Responsibility
Psychologists, UX Researchers Execute - Apply evaluation methods
Data Scientists Support - Design A/B tests and experiments

Evaluation Methods

  • A/B testing: Feature A vs. Feature B
  • Multi-armed bandit: Performance-based adaptive allocations
  • Holdout testing: AI vs. non-AI

Measurement Tools: 4 Categories

  1. On-Platform Behavioral Measures:

    • Frequency and depth of queries
    • Changes in language complexity
    • Follow-up question rate
    • Session duration & return rate
    • Feature utilization and AI suggestions followed
  2. Short Self-Report Surveys: Validated scales, brief and specific, integrated into flow

  3. NLP and Text Analysis: Sentiment analysis, topic modeling, LIWC, LLM-based analysis

  4. Off-Platform Measures: Longer surveys, observer reports, objective performance data

Level 4: Impact Evaluation

Level 4: Impact Evaluation

Question: Does the product improve development outcomes?

Why important: IEs measure effects on outcomes like mortality, learning, and earnings. The challenge: many things happen simultaneously, making simple before-and-after comparisons unreliable.

Solution: Use a counterfactual - a similar sample that didn’t receive the intervention. This captures what would have happened without the intervention, allowing us to isolate the intervention’s impact.

Level 4: Counterfactual Methods

Method Description Best For
RCT Random assignment to treatment/control Most credible; gold standard
Propensity Score Matching Match on observable characteristics When randomization not possible
Difference-in-Differences Compare trends before/after When parallel trends assumption holds
Regression Discontinuity Compare units just above/below cutoff When cutoff exists and is exogenous

RCTs are the most credible way to determine causal impact. Random assignment ensures differences can be attributed to the intervention, not population differences or external factors.

Who is Most Involved?

Role Responsibility
Policy Researchers, Economists Execute - Apply evaluation methods
AI Engineers Support - Ensure product functions as expected

Why Do an Impact Evaluation?

There are 3 main reasons to do an impact evaluation:

  1. Proof of concept: By isolating the effect of the intervention from the rest of the world, the impact evaluation allows you to causally attribute changes in outcomes to the intervention - giving you proof of concept.

  2. Proof in different settings: Once you know it works in a particular setting, with a particular target population, you may want to show it will work in other settings or for other populations - then you can do additional impact evaluations.

  3. Cost-benefit analysis: For many funders and public sector partners, IEs are critical for decision-making. They want credible evidence that a product meaningfully improves people’s lives - beyond engagement metrics or self-reported satisfaction - before committing to scale. A well-designed IE sends a strong signal that your product works in real-world conditions, and that scaling it is likely to generate meaningful social returns (see e.g. Hauser et al., 2025; UK GOV, 2025).

IEs also help funders compare across opportunities. When paired with cost data, they allow for robust estimates of cost-effectiveness and cost-benefit analysis - crucial when governments, donors, and multilateral institutions are allocating scarce resources. In many cases, the result of an IE becomes a key input in decisions to scale, replicate, or exit.

Important: It is important to be clear on why you are doing the impact evaluation at the outset, as this will affect the data you collect and how you design the evaluation. For example, if you are doing a proof of concept evaluation, you may want to invest more in collecting a rich set of outcomes in your Level 3 evaluations to understand how these map to ultimate welfare outcomes.

When is it Appropriate?

IEs are high-investment undertakings, both financially (they often cost millions of dollars) and operationally (service providers have to adapt their operations, often in challenging ways, to make them work). They are most useful when your product is mature enough to test and when the decision stakes are high enough to justify the effort.

In general, consider an IE when:

  • Levels 1-3 are strong: The model performs well, users engage meaningfully, and early evidence suggests improvements in knowledge, attitudes, or behavior.
  • You are preparing to scale: Funders or policymakers are considering wider adoption, but want evidence, including cost-effectiveness or cost-benefit estimates, to support the decision.
  • You have bandwidth: Implementing an IE is a lot of work for both the research team and implementer; doing it well takes time and effort.
  • You are confident your product works: The world may be uncertain that your product has a meaningful impact, but you shouldn’t be. Your earlier stage evaluations should give you confident priors that you’ll find effects on development outcomes.

When is it Appropriate?

You do NOT need to run an IE if your product is still in early design or if usage is inconsistent to the point where you are worried about impacts. In those cases, Level 3 evaluations can be more appropriate.

Plan for Evaluability Early

Although impact evaluations are typically conducted at later stages, designing credible and cost-effective IEs often requires thinking about design decisions far earlier in the process. Incorporating features like holdout groups, staged rollouts, or embedded randomization into the initial product architecture (which could also be useful for A/B tests) ensures that rigorous causal evaluation remains possible - without requiring disruptive redesigns later on.

Even if a full IE is not yet justified, these design choices create structured opportunities for credible inference when the time comes and can significantly reduce the burden of evaluation. Funders assessing scale readiness should look for these signals of early evaluability.

How to Do an IE Responsibly

Rigorous IEs require expertise. We recommend working with an independent evaluator - such as an academic partner, a research NGO (e.g., J-PAL, IPA), or a third-party M&E firm. This enhances both the technical quality and the perceived credibility/independence of your evaluation.

At a minimum, we suggest:

  • Clarifying roles: Who builds the product, who runs the study, who communicates findings
  • Pre-registering the design: On platforms such as the AEA RCT Registry, EGAP, or RIDIE
  • Sharing results transparently: Disclose all findings, including null or negative results, and make methods and materials publicly available where feasible to support reproducibility and sector-wide learning

Impact Evaluation Methods

  1. Randomized Control Trials: Random assignment to treatment/control
  2. Propensity Score Matching: Match on observable characteristics
  3. Difference-in-Differences: Compare trends before/after intervention
  4. Regression Discontinuity Design: Compare units just above/below cutoff

Resources: Impact Evaluation in Practice (Gertler et al., World Bank), Running Randomized Evaluations (Glennerster & Takavarasha)

Focus: What is distinctive when evaluating AI-based products

Key Design Consideration 1: Counterfactual Selection

Options:

  • Pure control: No intervention at all

  • Business-as-usual: No digital support or sporadic human guidance

  • Non-AI digital tools: Static chatbots or curated content

  • Human-delivered services: When AI substitutes for scarce labor (measure costs: Cost Measurement Guide)

Important: Justify selection and explain what it helps illuminate

Key Design Consideration 2: Measuring Latent Access

Critical: Marginal benefit depends on what other support users already have

Measure:

  • Existing technology use (frequency, type, purpose)

  • What users rely on today (informal networks, human advisors, basic tech)

  • Leakage - how much control group has access to intervention

Why: Shapes incremental value added by AI product

Key Design Consideration 3: Managing Product Dynamism

Challenge: SUTVA assumption (same version for all treated units) often violated

AI products: Designed to improve iteratively - different participants may interact with different versions

Solutions:

  • Tag your versions

  • If A/B testing, randomize test participation

  • Maintain hold-out group on baseline version

-Pre-specify at high level (not overly detailed)

Key Design Consideration 4: Measuring True Development Outcomes

Challenge: AI tools often simulate expertise - does user learn or just copy?

Solutions:

  • Use industry-standard validated assessments

  • Use administrative data

  • Avoid measures that can be gamed by repeating AI output

  • Test ability when users don’t have access to AI

Key Design Consideration 5: Managing Spillovers and Contamination

Challenge: AI tools designed for scale - freely accessible, easy to share

Strategies:

  • Controlled access: Individual or cluster assignment

  • Publicly available: Randomized encouragement design

  • High contamination risk: Run in settings with low existing exposure

  • Cluster randomization: By school or clinic

  • Monitor usage: Be prepared to adjust power calculations

Common Pitfalls to Avoid

  1. Being underpowered: Test uptake with small groups, bring pessimists into planning
  2. Mismanaging transparency: Balance rigor and responsiveness
  3. Letting product evolution obscure analysis: Pre-specify how changes handled analytically
  4. Underestimating attrition risks: Track engagement from start, plan for dropouts

Summary: The Four-Level Framework

Level 1 - Model: Does the AI model produce desired responses? - Rubric → Metrics → Golden Dataset → Automated Evals → Refine → Red-team

Level 2 - Product: Does the product facilitate meaningful interactions? - A/B tests, engagement metrics, retention, feature uptake

Level 3 - User: Does product support thoughts, feelings, knowledge, behaviors? - Cognitive, affective, behavioral outcomes via surveys, NLP, behavioral measures

Level 4 - Impact: Does access improve development outcomes? - RCTs and other methods to measure causal impact on mortality, learning, earnings

Key Principles

  1. Continuous evaluation: Rapid, ongoing cycles, not one-off experiments
  2. Cross-functional teams: All four perspectives must be represented
  3. Building blocks: User funnels, ETL pipelines, targeted hypotheses, experiments
  4. Complementary methods: All levels work together at different stages
  5. Right-sized evaluation: Make informed tradeoffs about what’s “enough”

Additional Resources

Questions?